8 research outputs found

    Low-Memory Techniques for Routing and Fault-Tolerance on the Fat-Tree Topology

    Full text link
    Actualmente, los clústeres de PCs están considerados como una alternativa eficiente a la hora de construir supercomputadores en los que miles de nodos de computación se conectan mediante una red de interconexión. La red de interconexión tiene que ser diseñada cuidadosamente, puesto que tiene una gran influencia sobre las prestaciones globales del sistema. Dos de los principales parámetros de diseño de las redes de interconexión son la topología y el encaminamiento. La topología define la interconexión de los elementos de la red entre sí, y entre éstos y los nodos de computación. Por su parte, el encaminamiento define los caminos que siguen los paquetes a través de la red. Las prestaciones han sido tradicionalmente la principal métrica a la hora de evaluar las redes de interconexión. Sin embargo, hoy en día hay que considerar dos métricas adicionales: el coste y la tolerancia a fallos. Las redes de interconexión además de escalar en prestaciones también deben hacerlo en coste. Es decir, no sólo tienen que mantener su productividad conforme aumenta el tamaño de la red, sino que tienen que hacerlo sin incrementar sobremanera su coste. Por otra parte, conforme se incrementa el número de nodos en las máquinas de tipo clúster, la red de interconexión debe crecer en concordancia. Este incremento en el número de elementos de la red de interconexión aumenta la probabilidad de aparición de fallos, y por lo tanto, la tolerancia a fallos es prácticamente obligatoria para las redes de interconexión actuales. Esta tesis se centra en la topología fat-tree, ya que es una de las topologías más comúnmente usadas en los clústeres. El objetivo de esta tesis es aprovechar sus características particulares para proporcionar tolerancia a fallos y un algoritmo de encaminamiento capaz de equilibrar la carga de la red proporcionando una buena solución de compromiso entre las prestaciones y el coste.Gómez Requena, C. (2010). Low-Memory Techniques for Routing and Fault-Tolerance on the Fat-Tree Topology [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8856Palanci

    Bringing Real Processorsto Labs

    Full text link
    This is the accepted version of the following article: Gómez, C., Gómez, M. E. and Sahuquillo, J. (2015), Bringing real processors to labs. Comput Appl Eng Educ, 23: 724–732. , which has been published in final form at http://dx.doi.org/10.1002/cae.21645The architecture of current processors has experienced great changes in the last years, leading to sophisticated multithreaded multicore processors. The inherent complexity of such processors makes difficult to update processor teaching to include current commercial products, especially at lab sessions where simplistic simulators are usually used. However, instructors are forced to reduce this gap if they want to properly prepare students in this topic. Dealing with these complex concepts at labs does not only help reinforce theoretical concepts but also has a positive effect in the students motivation. This article presents amethodology designed for the study of current microprocessor mechanisms in a gradual way without overwhelming students. The methodology is based on the use of a detailed simulation framework, used both in the academia and in the industry, which accurately models features from current processors. Due to the huge simulator complexity, it is introduced through several learning phases. Qualitative and quantitative results demonstrate that students are able to develop skills in a detailed simulator in a reasonable time period and, at the same time they learn the details of complex architectural mechanisms of commercial microprocessors.Contract grant sponsor: Spanish Government; Contract grant number: TIN2012-38341-C04-01Gómez Requena, C.; Gómez Requena, ME.; Sahuquillo Borrás, J. (2015). Bringing Real Processorsto Labs. Computer Applications in Engineering Education. 23(5):724-732. https://doi.org/10.1002/cae.21645S724732235D. Sanchez C. Kozyrakis ZSim: Fast and accurate microarchitectural simulation of thousand-core systems 2013 475 486U. Rafael J. Sahuquillo S. Petit P. Lopez Multi2Sim: A simulation framework to evaluate multicore-multithreaded processors 2007 62 68Aziz, S. M., Sicard, E., & Ben Dhia, S. (2010). Effective Teaching of the Physical Design of Integrated Circuits Using Educational Tools. IEEE Transactions on Education, 53(4), 517-531. doi:10.1109/te.2009.2031842Dexter, S. L., Anderson, R. E., & Becker, H. J. (1999). Teachers’ Views of Computers as Catalysts for Changes in Their Teaching Practice. Journal of Research on Computing in Education, 31(3), 221-239. doi:10.1080/08886504.1999.10782252Austin, T., Larson, E., & Ernst, D. (2002). SimpleScalar: an infrastructure for computer system modeling. Computer, 35(2), 59-67. doi:10.1109/2.982917T. E. Carlson W. Heirman L. Eeckhout Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation 2011 52http://www.multi2sim.orgS. Woo M. Ohara E. Torrie J. Singh A. Gupta The Splash-2 programs: Characterization and methodological considerations 1995 24 3

    XOR-based HoL-blocking Reduction Routing Mechanisms for Direct Networks

    Full text link
    [EN] Routing is a key design parameter in the interconnection network of large parallel computers. Routing algorithms are classified into two different categories depending on the number of routing options available for each source-destination pair: deterministic (there is one path available) and adaptive (there are several ones). Adaptive routing has two opposed effects on network performance. On one hand, it provides routing flexibility that may help on avoiding a congested network area, thus improving network performance. On the other hand, it also may increase the Head-of-Line blocking effect due to more destination nodes sharing the port queues. Usually, adaptive routing uses virtual channels to provide routing flexibility and to guarantee deadlock freedom. Deterministic routing is simpler, which implies lower routing delay and it introduces less Head-of-Line blocking effect. In this paper, we propose an adaptive and HoL-blocking reduction routing algorithm for direct topologies that tries to combine the good properties of both worlds: It provides routing flexibility but also reduces the Head-of-Line blocking effect. To do that, this paper proposes several functions which use the XOR operation to efficiently distribute the packets among virtual channels based on their destination node. The resulting routing mechanisms have different properties depending on whether they enforce routing flexibility or Head-of-Line blocking reduction.This work was supported by the Spanish Ministerio de Economia y Competitividad (MINECO) and by FEDER funds under Grant TIN2015-66972-05-1-R and by Programa de Ayudas de Investigacion y Desarrollo (PAID) from Universitat Politecnica de Valencia.Peñaranda Cebrián, R.; Gómez Requena, C.; Gómez Requena, ME.; López Rodríguez, PJ. (2017). XOR-based HoL-blocking Reduction Routing Mechanisms for Direct Networks. Parallel Computing. 67:57-74. https://doi.org/10.1016/j.parco.2017.06.004S57746

    Speeding-up the fault-tolerance analysis of interconnection networks

    Full text link
    © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other worksAnalyzing the fault-tolerance of interconnection networks implies checking the connectivity of each sourcedestination pair. The size of the exploration space of such operation skyrockets with the network size and with the number of link faults. However, this problem is highly parallelizable since the exploration of each path between a source–destination pair is independent of the other paths. This paper presents an approach to analyze the fault-tolerance degree of multistage interconnection networks using GPUs in order to speed-up it. This approach uses CUDA as parallel programming tool on a GPU in order to take advantage of all available cores. Results show that the execution time of the fault-tolerance exploration can be significantly reduced.This work was supported by the Spanish Ministerio de Economía y Competitividad (MINECO) and by FEDER funds under Grant TIN2012-38341-C04-01.Bermúdez Garzón, DF.; Gómez Requena, C.; López Rodríguez, PJ.; Gómez Requena, ME. (2015). Speeding-up the fault-tolerance analysis of interconnection networks. IEEE. https://doi.org/10.1109/HPCSim.2015.7237035

    The k-ary n-direct s-indirect family of topologies for large-scale interconnection networks

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/s11227-016-1640-zIn large-scale supercomputers, the interconnection network plays a key role in system performance. Network topology highly defines the performance and cost of the interconnection network. Direct topologies are sometimes used due to its reduced hardware cost, but the number of network dimensions is limited by the physical 3D space, which leads to an increase of the communication latency and a reduction of network throughput for large machines. Indirect topologies can provide better performance for large machines, but at higher hardware cost. In this paper, we propose a new family of hybrid topologies, the k-ary n-direct s-indirect, that combines the best features from both direct and indirect topologies to efficiently connect an extremely high number of processing nodes. The proposed network is an n-dimensional topology where the k nodes of each dimension are connected through a small indirect topology of s stages. This combination results in a family of topologies that provides high performance, with latency and throughput figures of merit close to indirect topologies, but at a lower hardware cost. In particular, it doubles the throughput obtained per cost unit compared with indirect topologies in most of the cases. Moreover, their fault-tolerance degree is similar to the one achieved by direct topologies built with switches with the same number of ports.This work was supported by the Spanish Ministerio de Economa y Competitividad (MINECO) and by FEDER funds under Grant TIN2012-38341-C04-01 and by Programa de Ayudas de Investigacion y Desarrollo (PAID) from Universitat Politecnica de Valencia.Peñaranda Cebrián, R.; Gómez Requena, C.; Gómez Requena, ME.; López Rodríguez, PJ.; Duato Marín, JF. (2016). The k-ary n-direct s-indirect family of topologies for large-scale interconnection networks. Journal of Supercomputing. 72(3):1035-1062. https://doi.org/10.1007/s11227-016-1640-z10351062723Connect-IB. http://www.mellanox.com/related-docs/prod_adapter_cards/PB_Connect-IB.pdf . Accessed 3 Feb 2016Mellanox store. http://www.mellanoxstore.com . Accessed 3 Feb 2016Mellanox technology. http://www.mellanox.com . Accessed 3 Feb 2016Myricom. http://www.myri.com . Accessed 3 Feb 2016Quadrics homepage. http://www.quadrics.com . Accessed 22 Sept 2008TOP500 supercomputer site. http://www.top500.org . Accessed 3 Feb 2016Balkan A, Qu G, Vishkin U (2009) Mesh-of-trees and alternative interconnection networks for single-chip parallelism. IEEE Trans Very Large Scale Integr(VLSI) Syst 17(10):1419–1432. doi: 10.1109/TVLSI.2008.2003999Bermudez Garzon D, Gomez ME, Lopez P, Duato J, Gomez C (2014) FT-RUFT: a performance and fault-tolerant efficient indirect topology. In: 22nd Euromicro international conference on parallel, distributed and network-based processing (PDP). IEEE, pp 405–409Bhandarkar SM, Arabnia HR (1995) The Hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107–114Boku T, Nakazawa K, Nakamura H, Sone T, Mishima T, Itakura K (1996) Adaptive routing technique on hypercrossbar network and its evaluation. Syst Comput Jpn 27(4):55–64Dally W, Towles B (2004) Principles and practices of interconnection networks. Morgan Kaufmann, San FranciscoDas R, Eachempati S, Mishra A, Narayanan V, Das C (2009) Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs. In: IEEE 15th international symposium on high performance computer architecture (HPCA’09), pp 175–186. doi: 10.1109/HPCA.2009.4798252Mahdaly AI, Mouftah HT, Hanna NN (1990) Topological properties of WK-recursive networks. In: Proceedings of IEEE workshop on future trends of distributed computing systems, pp 374–380. doi: 10.1109/FTDCS.1990.138349Duato J (1996) A necessary and sufficient condition for deadlock-free routing in cut-through and store-and-forward networks. IEEE Trans Parallel Distrib Syst 7:841–854. doi: 10.1109/71.532115Duato J, Yalamanchili S, Lionel N (2002) Interconnection networks: an engineering approach. Morgan Kaufmann Publishers Inc., USAFlich J, Malumbres M, López P, Duato J (2000) Improving routing performance in Myrinet networks. In: International on parallel and distributed processing symposium, p 27. doi: 10.1109/IPDPS.2000.845961García M, Beivide R, Camarero C, Valero M, Rodríguez G, Minkenberg C (2015) On-the-fly adaptive routing for dragonfly interconnection networks. J Supercomput 71(3):1116–1142Gómez C, Gilabert F, Gómez M, López P, Duato J (2007) Deterministic versus adaptive routing in fat-trees. In: IEEE international on parallel and distributed processing symposium (IPDPS’07), pp 1–8. doi: 10.1109/IPDPS.2007.370482Gómez C, Gilabert F, Gómez M, López P, Duato J (2008) RUFT: simplifying the fat-tree topology. In: 14th IEEE international conference on parallel and distributed systems (ICPADS’08), pp 153–160. doi: 10.1109/ICPADS.2008.44Guo C, Lu G, Li D, Wu H, Zhang X, Shi Y, Tian C, Zhang Y, Lu S (2009) BCube: a high performance, server-centric network architecture for modular data centers. In: SIGCOMM ’09: proceedings of the ACM SIGCOMM 2009 conference on data communication. ACM, New York, pp 63–74. doi: 10.1145/1592568.1592577 . http://www.bibsonomy.org/bibtex/23a5da89fbf099e3c70f4559ab38082c5/chesteve . Accessed 22 Sept 2008Gupta A, Dally W (2006) Topology optimization of interconnection networks. Comput Arch Lett 5(1):10–13. doi: 10.1109/L-CA.2006.8Kim J, Dally W, Abts D (2007) Flattened butterfly: a cost-efficient topology for high-radix networks. In: Proceedings of the 34th annual international symposium on computer architecture (ISCA’07). ACM, New York, pp 126–137. doi: 10.1145/1250662.1250679Kim J, Dally W, Scott S, Abts D (2008) Technology-driven, highly-scalable dragonfly topology. In: Proceedings of the 35th annual international symposium on computer architecture (ISCA’08). IEEE Computer Society, Washington, DC, pp 77–88. doi: 10.1109/ISCA.2008.19Leighton F (1992) Introduction to parallel algorithms and architectures: arrays, trees, hypercubes v. 1. M. Kaufmann Publishers, San FranciscoLeiserson CE (1985) Fat-trees: universal networks for hardware-efficient supercomputing. IEEE Trans Comput 34(10):892–901Matsutani H, Koibuchi M, Amano H (2007) Performance, cost, and energy evaluation of fat H-tree: a cost-efficient tree-based on-chip network. In: IEEE international on parallel and distributed processing symposium (IPDPS’07), pp 1–10. doi: 10.1109/IPDPS.2007.370271Rahmati D, Kiasari A, Hessabi S, Sarbazi-Azad H (2006) A performance and power analysis of wk-recursive and mesh networks for network-on-chips. In: International conference on computer design (ICCD’06), pp 142–147. doi: 10.1109/ICCD.2006.4380807Towles B, Dally WJ (2002) Worst-case traffic for oblivious routing functions. In: Proceedings of the fourteenth annual ACM symposium on parallel algorithms and architectures (SPAA’02). ACM, New York, pp 1–8. doi: 10.1145/564870.564872Yang Y, Funahashi A, Jouraku A, Nishi H, Amano H, Sueyoshi T (2001) Recursive diagonal torus: an interconnection network for massively parallel computers. IEEE Trans Parallel Distrib Syst 12(7):701–715. doi: 10.1109/71.94074

    A Family of Fault-Tolerant Efficient Indirect Topologies

    Full text link
    © 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.On the one hand, performance and fault-tolerance of interconnection networks are key design issues for high performance computing (HPC) systems. On the other hand, cost should be also considered. Indirect topologies are often chosen in the design of HPC systems. Among them, the most commonly used topology is the fat-tree. In this work, we focus on getting the maximum benefits from the network resources by designing a simple indirect topology with very good performance and fault-tolerance properties, while keeping the hardware cost as low as possible. To do that, we propose some extensions to the fat-tree topology to take full advantage of the hardware resources consumed by the topology. In particular, we propose three new topologies with different properties in terms of cost, performance and fault-tolerance. All of them are able to achieve a similar or better performance results than the fat-tree, providing also a good level of fault-tolerance and, contrary to most of the available topologies, these proposals are able to tolerate also faults in the links that connect to end nodes.This work was supported by the Spanish Ministerio de Economia y Competitividad (MINECO) and by FEDER funds under Grant TIN2012-38341-C04-01.Bermúdez Garzón, DF.; Gómez Requena, C.; Gómez Requena, ME.; López Rodríguez, PJ.; Duato Marín, JF. (2016). A Family of Fault-Tolerant Efficient Indirect Topologies. IEEE Transactions on Parallel and Distributed Systems. 27(4):927-940. https://doi.org/10.1109/TPDS.2015.2430863S92794027

    Efficient Selective Multicore Prefetching under Limited Memory Bandwidth

    Full text link
    [EN] Current multicore systems implement multiple hardware prefetchers to tolerate long main memory latencies. However, memory bandwidth is a scarce shared resource which becomes critical with the increasing core count. To deal with this fact, recent works have focused on adaptive prefetchers, which control the prefetcher aggressiveness to regulate the main memory bandwidth consumption. Nevertheless, in limited bandwidth machines or under memory-hungry workloads, keeping active the prefetcher can damage the system performance and increase energy consumption. This paper introduces selective prefetching, where individual prefetchers are activated or deactivated to improve both main memory energy and performance, and proposes ADP, a prefetcher that deactivates local prefetchers in some cores when they present low performance and co-runners need additional bandwidth. Based on heuristics, an individual prefetcher is reactivated when performance enhancements are foreseen. Compared to a state-of-the-art adaptive prefetcher, ADP provides both performance and energy enhancements in limited memory bandwidth. (C) 2018 Elsevier Inc. All rights reserved.Selfa-Oliver, V.; Sahuquillo Borrás, J.; Gómez Requena, ME.; Gómez Requena, C. (2018). Efficient Selective Multicore Prefetching under Limited Memory Bandwidth. Journal of Parallel and Distributed Computing. 120:32-43. https://doi.org/10.1016/j.jpdc.2018.05.002S324312

    A HoL-blocking aware mechanism for selecting the upward path in fat-tree topologies

    Full text link
    The final publication is available at Springer via http://link.springer.com/article/10.1007%2Fs11227-014-1303-xLarge cluster-based machines require efficient high-performance interconnection networks. Routing is a key design issue of interconnection networks. Adaptive routing usually outperforms deterministic routing at the expense of introducing out-of-order packet delivery. Many of the commodity interconnects for clusters are based on fat-trees. The adaptive routing algorithm commonly used in fat-trees is composed of a fully adaptive upward subpath, followed by a deterministic downward subpath. As the latter is determined by the former, choosing the most adequate upward path for each packet is critical in fat-trees to achieve a good performance. In this paper, we present a mechanism for selecting the upward path in fat-trees, which enables optimum use of the available network resources to achieve a high network throughput. The proposed path selection is destination based, which allows reducing the head-of-line blocking effect. Indeed, the proposed mechanism can be used either as a selection function (the provided path is used as the preferred one), or as a deterministic routing algorithm (the path is the only possible one). The results show that the resulting selection function outperforms any other known one. Moreover, the proposed deterministic routing algorithm can achieve a similar, or even higher, level of performance than adaptive routing, while providing in-order packet delivery and a simpler switch implementation.This work was supported by the Spanish Ministerio de Ciencia e Innovacion (MICINN) and jointly financed with Plan E funds, under Grant TIN2009-14475-C04 as well as by Consolider-Ingenio 2010 under Grant CSD2006-00046.Gómez Requena, C.; Gilabert Villamón, F.; Gómez Requena, ME.; López Rodríguez, PJ.; Duato Marín, JF. (2015). A HoL-blocking aware mechanism for selecting the upward path in fat-tree topologies. Journal of Supercomputing. 71(7):2339-2364. https://doi.org/10.1007/s11227-014-1303-xS23392364717Abali B et al (2001) Adaptive routing on the new switch chip for IBM SP systems. J Parallel Distrib Comput 61(9):1148–1179Bakker E, van Leeuwer J, Tan RB (1991) Linear interval routing. Algoritms Rev 2:45–61Bogdanski B, Reinemo S-A, Sem-Jacobsen FO, Gran sFtree EG (2012) A fully connected and deadlock free switch-to-switch routing algorithm for fat-trees. ACM Trans Archit Code Optim 8(4):55-1–55-20Bogdanski B, Dag B, Reinemo S-A, Flich J (2013) Making the network scalable: inter-subnet routing in InfiniBand. In: Proceedings of the Euro-Par 2013 international conferenceDally WJ, Towles B (2004) Principles and practices of interconnection networks. Morgan Kaufmann, BurlingtonDuato J, Yalamanchili S, Ni L (2004) Interconnection networks: an engineering approach. Morgan Kaufmann, BurlingtonEscudero-Sahuquillo J, Gunnar E, Garcia PJ, Flich J, Skeie T, Lysne O, Quiles FJ, Duato J (2014) Efficient and cost-effective hybrid congestion control for HPC interconnection networks. IEEE Trans Parallel Distrib Syst (to apear). doi: 10.1109/TPDS.2014.2307851Flich J, Malumbres MP, López P, Duato J (2000) Improving routing performance in Myrinet networks. In: Proceedings of the 14th international parallel and distributed processing symposiumGarcía PJ, Flich J, Duato J, Johnson I, Quiles FJ, Naven F (2005) Dynamic evolution of congestion trees: analysis and impact on switch architecture. In: Proceedings of 1st HiPEAC conference, pp 266–285Geoffray P, Hoefler T (2008) Adaptive routing strategies for modern high performance networks. In: IEEE HOTIGilabert F, Gómez ME, López P, Duato J (2006) On the influence of the selection function on the performance of fat-trees. In: European conference on parallel computingGreenberg R, Leiserson C (1985) Randomized routing on fat-trees. In: Annual symposium on the foundations of computer scienceGómez ME, López P, Duato J (2005) A memory-effective routing strategy for regular interconnection networks. In: IEEE international parallel and distributed processing symposiumGómez C, Gilabert F, Gómez ME, López P, Duato J (2007) Deterministic versus adaptive routing in fat-trees workshop on communication architecture on clusters. In: IEEE international parallel and distributed processing symposiumHillis WD, Tucker L (1993) The CM-5 connection machine: a scalable supercomputer. Commun ACM 36(11):31–40Hoefler T, Schneider T, Lumsdaine A (2009) Optimized routing for large-scale InfiniBand networks. In: Proceedings of the 2009 17th IEEE symposium on high performance interconnectsInfiniband Trade Association. http://www.infinibandta.orgJohnson G, Kerbbyson D, Lang M (2008) Optimization of InfiniBand scientific applications. In: 22nd international parallel and distributed processingKariniemi H (2006) On-line reconfigurable extended generalized fat tree network-on-chip for multiprocessor system-on-chip circuits. PhD. thesis, Tampere University of TechnologyKarol MJ, Hluchyj MG, Morgan SP (1987) Input versus output queueing on a space-division packet switch. IEEE Trans Commun 35:1347–1356Kim J, Park D, Theocharides T, Vijaykrishnan N, Das CR (2005) A low latency router supporting adaptivity for on-chip interconnects. In: 42nd annual conference on design automationKim J, Dally WJ, Dally J, Abts D (2006) Adaptive routing in high-radix clos network. In: SC 2006 conference, proceedings of the ACM/IEEE, Tampa, FL, 7 Nov 2006. doi: 10.1109/SC.2006.10Lin X, Chung Y, Huang T (2004) A multiple LID routing for fat-tree-based InfiniBand networks. In: IEEE international parallel and distributed processing symposiumMartínez JC, Flich J, Robles A, López P, Duato J (2004) Supporting adaptive routing in IBA switches. J Syst Archit 49:441–449Martínez JC, Flich J, Robles A, López P, Duato J, Koibuchi M (2005) In-order packet delivery in interconnection networks using adaptive routing. In: IEEE international parallel and distributed processing symposiumMyricom. http://www.myri.comPetrini F, Vanneschi M (1995) k-ary n-tress: high performance networks for massively parallel architecture. In: IEEE Micro, vol 15Quadrics homepage. http://www.quadrics.comScott S, Abts D, Kim J, Dally WJ (2006) The BlackWidow high-radix clos network. In: International sympium on computer architectureRuemmler C, Wilkes J (1993) Unix disk access patterns. In: Winter Usenix conferenceTianhe. http://www.nscc-tj.gov.cn/en/Top 500 Supercomputer site (2014). http://www.top500.orgVishnu A, Koop M, Moody A, Mamidala A, Narravula S, Panda D (2007) Hot-spot avoidancce with multipathing over InfiniBand: an MPI perspective. In: International symposium on cluster computing and the gridZahavi E, Johnson G, Kerbyson DJ, Lang M (2010) Optimized InfiniBandTM fat-tree routing for shift all-to-all communication patterns. Concurr Comput Pract Experience 22:
    corecore